While locations of 2020, 2024, and 2028 Summer Olympics are decided, (Tokyo, Paris, and Los Angeles in order), 2032’s location is not decided and its bidding process has not begun yet. Delhi is very serious in entering the bid, and in preparation, its committee wants to build a model that can predict weather to assess its feasibility.
Baseline climatological information includes temperature, relative humidity, wind speed and direction, precipitation, and visibility. Several additional climatological elements for the summer games can include lightning, heat index, UV index, sea breeze, and ocean currents. While most climatological data is available from existing observation stations usually. These are guidelines from World Meteorological Organization. https://www.wmo.int/pages/prog/amp/pwsp/documents/PWS_Olympic_Guideline_Draft-Ver3.pdf
Although its specific timing is unknown, as Summer Olympics took places between April and October in the past, being able to predict temperature and humidity of the city of New Delhi in such time frame will help plan the bid.
There is a dataset available for past 20+ years weather on hourly basis for New Delhi, India. https://www.kaggle.com/mahirkukreja/delhi-weather-data/downloads/delhi-weather-data.zip/2
About the dataset: 100990 rows(observations) and 20 columns(variables)
## [1] 100990
## [1] 20
## [1] "datetime_utc" "X_conds" "X_dewptm" "X_fog"
## [5] "X_hail" "X_heatindexm" "X_hum" "X_precipm"
## [9] "X_pressurem" "X_rain" "X_snow" "X_tempm"
## [13] "X_thunder" "X_tornado" "X_vism" "X_wdird"
## [17] "X_wdire" "X_wgustm" "X_windchillm" "X_wspdm"
## datetime_utc X_conds X_dewptm X_fog
## Length:100990 Length:100990 Min. :-24.00 Min. :0.00000
## Class :character Class :character 1st Qu.: 10.00 1st Qu.:0.00000
## Mode :character Mode :character Median : 15.00 Median :0.00000
## Mean : 15.74 Mean :0.06969
## 3rd Qu.: 22.00 3rd Qu.:0.00000
## Max. : 75.00 Max. :1.00000
## NA's :621
## X_hail X_heatindexm X_hum X_precipm
## Min. :0.0000000 Min. :26.80 Length:100990 Mode:logical
## 1st Qu.:0.0000000 1st Qu.:31.70 Class :character NA's:100990
## Median :0.0000000 Median :35.10 Mode :character
## Mean :0.0001287 Mean :35.65
## 3rd Qu.:0.0000000 3rd Qu.:39.20
## Max. :1.0000000 Max. :73.60
## NA's :71835
## X_pressurem X_rain X_snow X_tempm
## Min. : -9999 Min. :0.00000 Min. :0.0e+00 Min. : 1.00
## 1st Qu.: 1002 1st Qu.:0.00000 1st Qu.:0.0e+00 1st Qu.:19.00
## Median : 1008 Median :0.00000 Median :0.0e+00 Median :27.00
## Mean : 1932 Mean :0.02626 Mean :9.9e-06 Mean :25.45
## 3rd Qu.: 1014 3rd Qu.:0.00000 3rd Qu.:0.0e+00 3rd Qu.:32.00
## Max. :101061443 Max. :1.00000 Max. :1.0e+00 Max. :90.00
## NA's :232 NA's :673
## X_thunder X_tornado X_vism X_wdird
## Min. :0.000000 Min. :0.00e+00 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.: 1.500 1st Qu.: 50.0
## Median :0.000000 Median :0.00e+00 Median : 2.000 Median :160.0
## Mean :0.009427 Mean :1.98e-05 Mean : 2.403 Mean :163.6
## 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.: 3.000 3rd Qu.:270.0
## Max. :1.000000 Max. :1.00e+00 Max. :6436.000 Max. :960.0
## NA's :4428 NA's :14755
## X_wdire X_wgustm X_windchillm X_wspdm
## Length:100990 Min. :25.90 Min. :2.10 Min. : 0.000
## Class :character 1st Qu.:33.30 1st Qu.:4.90 1st Qu.: 0.000
## Mode :character Median :37.00 Median :6.10 Median : 7.400
## Mean :37.67 Mean :5.71 Mean : 7.678
## 3rd Qu.:40.70 3rd Qu.:6.80 3rd Qu.: 11.100
## Max. :92.60 Max. :7.30 Max. :1514.900
## NA's :99918 NA's :100411 NA's :2358
Missing Values
| Column Name | Number of NA’s | % of NA’s |
|---|---|---|
| X_dewptm | 621 | .61% |
| X_heatindexm | 71835 | 71% |
| X_precipm | 100990 | 100% |
| X_pressurem | 232 | .23% |
| X_tempm | 673 | .66% |
| X_vism | 4428 | 4.4% |
| X_wdird | 14755 | 14.6% |
| X_wgustm | 99918 | 98.9% |
| X_windchillm | 100411 | 99.4% |
| X_wspdm | 2358 | 2.33% |
While temperature of the city (X_tempm) is #1 point of interest in this exercise, from summary of the dataset, we found maximum number is 90 in the variable X_tempm. We need to diagose if this number is correct or not?
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm
## 91392 20150624-03:00 Patches of Fog 25 1 0 NA
## X_hum X_precipm X_pressurem X_rain X_snow X_tempm X_thunder
## 91392 4 NA 997 0 0 90 0
## X_tornado X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 91392 0 0.5 NA NA NA 0
In the temperature column, the temperature numbers were in Celsius. If we think 90 is in Celsius, this is not possible for a temperature in New Delhi. We checked other time on Jun 24 to see how tempeature look like
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 91393 20150624-06:00 Smoke 26 0 0 NA 56
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 91393 NA 997 0 0 34 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 91393 1 50 NE NA NA 9.3
After we checked the temperature on Jun 24 at 6:00am was 34 in this dataset, we decided that it was reasonable to assume that 90 was in Fahrenheit. Therefore, we changed it to Celsius which is 32 Celsius.
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm
## 91392 20150624-03:00 Patches of Fog 25 1 0 NA
## X_hum X_precipm X_pressurem X_rain X_snow X_tempm X_thunder
## 91392 4 NA 997 0 0 32 0
## X_tornado X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 91392 0 0.5 NA NA NA 0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 19.00 27.00 25.45 32.00 72.00 673
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 11085 19981125-01:30 Smoke 6 0 0 NA 4
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 11085 NA 1015 0 0 72 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 11085 2.8 260 West NA NA 3.7
We assume that the 72 was also in Fahrenheit. We will convert it to Celsius which will be 22.
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 11085 19981125-01:30 Smoke 6 0 0 NA 4
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 11085 NA 1015 0 0 22 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 11085 2.8 260 West NA NA 3.7
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 19.00 27.00 25.45 32.00 63.00 673
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 12952 19990529-11:30 Haze 7 0 0 NA 4
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 12952 NA 995 0 0 63 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 12952 4.5 300 WNW NA NA 18.5
The 63 could be a typo as temperature in May in New Delhi usually around 40 / 20(high/low). So we check other time on May 29 to see what the temperature recorded in the dataset. We found the temperature at 12:30 is 42, so we will replace the 63 by 42.
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 12953 19990529-12:30 Haze 7 0 0 NA 12
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 12953 NA 994 0 0 42 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 12953 4 270 West NA NA 14.8
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 12952 19990529-11:30 Haze 7 0 0 NA 4
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 12952 NA 995 0 0 42 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 12952 4.5 300 WNW NA NA 18.5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 19.00 27.00 25.45 32.00 62.00 673
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 68078 20061231-06:00 Smoke 9 0 0 NA 4
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 68078 NA 1018 0 0 62 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 68078 1 180 South NA NA 7.4
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 68077 20061231-03:00 Partial Fog 8 1 0 NA 92
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 68077 NA 1017 0 0 9 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 68077 0.2 NA NA NA 0
## datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 68079 20061231-09:00 Smoke 14 0 0 NA 59
## X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 68079 NA 1014 0 0 21 0 0
## X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 68079 1 230 SW NA NA 1.9
After we checked one tempeature before and after 6:00AM, we belive the 62 was in Fahrenheit, therefore, we will change it to 17 in Celsius.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 19.00 27.00 25.45 32.00 47.00 673
We checked tempeature history of New Delhi online and found that these days show above had high tempearture around 47. Since there were more than 10 observations, we will assume that it is reasonale and proceed without further value adjustments.
Replaceing Missing data: mean imputation method
## [1] 25.44953
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 27.00 25.45 32.00 47.00
There is no more NA’s.
## [1] "integer"
## datetime_utc X_hum X_tempm
## Length:100990 Min. : 4.00 Min. : 1.00
## Class :character 1st Qu.: 39.00 1st Qu.:19.00
## Mode :character Median : 59.00 Median :27.00
## Mean : 57.91 Mean :25.45
## 3rd Qu.: 78.00 3rd Qu.:32.00
## Max. :243.00 Max. :47.00
## NA's :757
When we convert the character to integer, NA’s introduced by coercion. Therefore, we will need to deal NA’s in variable X_hum. We also need to check if the maximum number 243 is reasonable number?
## datetime_utc X_hum X_tempm
## 95249 20160607-03:00 243 32
## datetime_utc X_hum X_tempm
## 95248 20160607-02:00 66 31
## datetime_utc X_hum X_tempm
## 95250 20160607-04:00 56 34
We also checked online at link: [https://www.wunderground.com/history/daily/in/new-delhi/VIDP/date/2016-6-7]
So the 243 is an error number. We will replace it with the number 55 which is from online.
Summary the X_hum to see if all numbers make sense.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 39.00 59.00 57.91 78.00 225.00 757
## datetime_utc X_hum X_tempm
## 61571 20040929-15:00 225 27
## datetime_utc X_hum X_tempm
## 61570 20040929-12:00 42 31
## datetime_utc X_hum X_tempm
## 61572 20040929-18:00 77 26
We also checked online at link: [https://www.wunderground.com/history/daily/in/new-delhi/VIDP/date/2004-9-29]
So the 225 is an error number. We will replace it with the number 43 which is from online.
Summary the X_hum to see if all numbers make sense
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 39.00 59.00 57.91 78.00 135.00 757
## datetime_utc X_hum X_tempm
## 73297 20081027-21:00 135 20
## datetime_utc X_hum X_tempm
## 73296 20081027-18:00 55 21
There is no one after 21:00 on that day. We checked online at link: [https://www.wunderground.com/history/daily/in/new-delhi/VIDP/date/2008-10-27] and found the humidity number is 57 for 20081027-21:00. So, we will replace the 135 with 57.
Summary the X_hum to see if all numbers make sense
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 39.00 59.00 57.91 78.00 100.00 757
Now we see the maximum humidity number is 100. Is it possible? Surprisingly, yes, the condition is known as supersaturation. At any given temperature and air pressure, a specific maximum amount of water vapor in the air will produce a relative humidity (RH) of 100 percent
## [1] 57.90501
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 39.00 58.00 57.91 77.00 100.00
## [1] "character"
## [1] "Date"
## [1] "1996-11-01" "1996-11-01" "1996-11-01" "1996-11-01" "1996-11-01"
## [6] "1996-11-01"
## [1] "2017-04-24" "2017-04-24" "2017-04-24" "2017-04-24" "2017-04-24"
## [6] "2017-04-24"
As time series analysis requires a series of equally disperse time points, the dataset is aggregated by date by mean, in order to have one observation per day. Then use ggplot to visualize the daily temperature data
The new data set is plotted for review. The dataset seems pretty stationary with seasonality.
Now, the dataset will be decomposed for further analysis. Not a significant trend exits but seasonality is definitely there.
Now the dataset’s stationarity is being check to see whether this is suitable for ARIMA model. And the test results indicate that this is stationary.
##
## Augmented Dickey-Fuller Test
##
## data: tstemp
## Dickey-Fuller = -5.013, Lag order = 19, p-value = 0.01
## alternative hypothesis: stationary
While running ACF and PACF to help find p, d, q value, it is noticed that there are too many observations in the dataset.
For now, an ARIMA model is being developed with auto. arima function with existing dateset and its result is being reviewed
## Series: tstemp
## ARIMA(2,0,1)(0,1,0)[365]
##
## Coefficients:
## ar1 ar2 ma1
## 1.6188 -0.6255 -0.9008
## s.e. 0.0171 0.0162 0.0113
##
## sigma^2 estimated as 5.459: log likelihood=-15813.33
## AIC=31634.67 AICc=31634.67 BIC=31662.07
Although previous model(daily) produces a reasonable model, it seems to have some readability issue in test results, in addition to computation time. We are aggregating the dataset to monthly mean to develop another model.
We notice that curves are smoother and the dataset is still more or less stationary.
##
## Augmented Dickey-Fuller Test
##
## data: tstemp2
## Dickey-Fuller = -16.673, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
For ACF and PACF tests, they are conceptually the same as daily dataset’s but gotten easier to read.
## Series: tstemp2
## ARIMA(3,0,3)(0,1,1)[12] with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 sma1 drift
## 0.6128 0.2931 -0.3083 -0.2902 -0.2260 0.0685 -0.6233 0.0059
## s.e. 0.3151 0.3544 0.2534 0.3212 0.3085 0.2208 0.0816 0.0059
##
## sigma^2 estimated as 3.561: log likelihood=-477.55
## AIC=973.09 AICc=973.9 BIC=1004.15
Since there are lags over level of significance still being noticed, an ARIMA model with manual order will be tried to improve accuracy.
##
## Call:
## arima(x = tstemp2, order = c(10, 0, 10), seasonal = list(order = c(0, 1, 1)))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## 0.6505 -0.0683 0.2018 -0.3807 0.1023 -0.0680 0.3535 -0.5979
## s.e. 0.8356 0.8701 0.4213 0.3477 0.4130 0.2597 0.2062 0.3860
## ar9 ar10 ma1 ma2 ma3 ma4 ma5 ma6
## 0.3133 0.3496 -0.3211 0.1346 -0.2757 0.3811 -0.0098 0.2814
## s.e. 0.6835 0.6089 0.8498 0.5892 0.3464 0.3504 0.4066 0.1962
## ma7 ma8 ma9 ma10 sma1
## -0.4703 0.4528 -0.4411 -0.2669 -0.8459
## s.e. 0.3184 0.4971 0.5960 0.6385 0.0732
##
## sigma^2 estimated as 2.592: log likelihood = -453.99, aic = 951.98
Since the model seems to have better accuracy, now the model is being tested for its forecast accuracy with actual value. For this, first 200 values are being used to train the model and remaining actual 45 values are being compared to forecast.
Although it is more or less redundant, as humidity also plays a big role in Summer weather condition, another ARIMA model is being built for a humidity prediction model.
##
## Augmented Dickey-Fuller Test
##
## data: tshum
## Dickey-Fuller = -8.0112, Lag order = 19, p-value = 0.01
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: tshum2
## Dickey-Fuller = -8.0591, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## Series: tshum2
## ARIMA(3,0,2)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1
## 0.5360 0.5231 -0.2334 -0.1125 -0.5243 -0.7908
## s.e. 0.5621 0.4317 0.1965 0.5531 0.3808 0.0681
##
## sigma^2 estimated as 76.76: log likelihood=-839.22
## AIC=1692.44 AICc=1692.94 BIC=1716.6
##
## Call:
## arima(x = tshum2, order = c(11, 0, 11), seasonal = list(order = c(0, 1, 1)))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## -0.2813 -0.3619 -0.0340 0.1810 -0.0844 0.3095 0.3501 0.3623
## s.e. NaN 0.5051 0.5055 0.4067 0.4509 0.4025 NaN 0.1806
## ar9 ar10 ar11 ma1 ma2 ma3 ma4 ma5
## 0.1164 -0.0378 -0.2439 0.7277 0.7589 0.3836 0.1574 0.2916
## s.e. NaN 0.2477 0.1287 NaN 0.4767 0.6668 0.8873 0.8566
## ma6 ma7 ma8 ma9 ma10 ma11 sma1
## -0.1127 -0.3389 -0.6567 -0.567 -0.4270 0.2227 -0.7122
## s.e. 0.8761 0.5208 0.3486 NaN 0.2873 NaN 0.0944
##
## sigma^2 estimated as 60.17: log likelihood = -819.34, aic = 1686.67
Due to time contraint and lack of understanding in all of time series model algorithms, this project has only demonstrated the usage of ARIMA model. However, both Auto ARIMA and Manual ARIMA are tested to determine a model with better accuracy, ACF and PACF results are examined to utilized manual ARIMA fitting.
Weather data from 1996 to 2017 is used to build the model, however, 2032 Summer Olympics is 13 years away from now which is very distant from now. It is realized that the current model might not stay completely feasible for 2032 prediction and will require further data and model re-fitting in future. Although there is not a significant trend in the data now, it may develop a trend due to global climate change. In addition, it is possible that it can develop a strong correlation among different variables.
However, it is believed that there still is reasonable value in the built model. From Summer Olympics precedents, climate from past events can be compared to different months of the city, then when reasonable target months are decided, that information can further develop other economical and political case studies to test the hosting city’s feasibility.
ShinyAPP for this model is developed to generate predicted termperature and humidity value of future months. Maximum 240 months(20 years) prediction is possible, which covers up to year 2037, while 2032 is main focus. The predicted value can be seen in both table and graph format. In addition, it also enables users to review past temperature and humidity record by year and month so they can intuitively get understanding in the city’s climate. While all 22 years’ record is available, a slider control is placed to shorten the range if certain period needs to be focused.